IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



Appellant: 


Bin ZHANG 


§ 


Confirmation No.: 


5792 


Serial No.: 


1 0/694,367 


§ 
§ 


Group Art Unit: 


2168 


Filed: 


1 0/27/2003 


§ 
§ 


Examiner: 


J. A. Morrison 


For: 


Data Mining Method 


§ 
§ 


Docket No.: 


200310832-1 




And System Using 


§ 








Regression Clustering 


§ 







SUPPLEMENTAL BRIEF TO REINSTATE APPEAL 

Mail Stop Appeal Brief - Patents Date: September 1 7, 2007 

Commissioner for Patents 
PO Box 1 450 

Alexandria, VA 22313-1450 
Sir: 

Appellant hereby submits this Supplemental Brief to Reinstate Appeal in 
response to the Office action dated July 16, 2007 in connection with the above- 
identified application. 



207387.02/2162.15800 



Page 1 of 28 



HP PDNO 200310832-1 



Appl. No. 10/694,367 

Supplemental Brief dated September 17, 2007 
Reply to Office action of July 16, 2007 



TABLE OF CONTENTS 

I. REAL PARTY IN INTEREST 3 

II. RELATED APPEALS AND INTERFERENCES 4 

III. STATUS OF THE CLAIMS 5 

IV. STATUS OF THE AMENDMENTS 6 

V. SUMMARY OF THE CLAIMED SUBJECT MATTER 7 

VI. GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 9 

VII. ARGUMENT 10 

A. The anticipation rejection of claim 24 1 0 

B. The obviousness rejections 1 0 

1. Claims 1, 15, 17 and 28-30 10 

2. Claims 18-22 11 

3. Claims 25-27 11 

C. Conclusion 12 

VIII. CLAIMS APPENDIX 1 3 

IX. EVIDENCE APPENDIX 20 

X. RELATED PROCEEDINGS APPENDIX 28 



207387.02/2162.15800 



Page 2 of 28 



HP PDNO 200310832-1 



Appl. No. 10/694,367 

Supplemental Brief dated September 17, 2007 
Reply to Office action of July 16, 2007 

I. REAL PARTY IN INTEREST 

The real party in interest is the Hewlett-Packard Development Company 
(HPDC), a Texas Limited Partnership, having its principal place of business in 
Houston, Texas. The Assignment from the inventor to HPDC was recorded on 
October 27, 2003, at Reel/Frame 014652/0977. 
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II. RELATED APPEALS AND INTERFERENCES 

Appellant is unaware of any related appeals or interferences. 
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III. STATUS OF THE CLAIMS 

Originally filed claims: 1 -30. 

Claim cancellations: None. 

Added claims: None. 

Presently pending claims: 1 -30. 

Presently appealed claims: 1,15, 17-22, 24-27, and 30. 

The Examiner concluded that the remaining pending claims are allowed or 
allowable if rewritten in independent form and thus are not being appealed. 
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IV. STATUS OF THE AMENDMENTS 

Appellant attempted to amend various claims after the Final Office Action 
dated October 1 7, 2006, but the Examiner did not enter the amendments. 
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V. SUMMARY OF THE CLAIMED SUBJECT MATTER 

With the increase in the amount of data being stored in databases, the 
need to efficiently and accurately analyze data is increasing. Appellant's 
disclosure, para. [0002]. Appellant's contribution relates to techniques for 
efficiently mining data from datasets distributed across multiple locations. 

According to the invention of claim 1, a processor-based method 
comprises selecting a set number of functions correlating variable parameters of 
a dataset. See e.g., Fig. 2, ref. no. 30 and para. [0025]. The method further 
comprises clustering the dataset by iteratively applying a regression algorithm 
and a K-Harmonic Means performance function on the set number of functions to 
determine a pattern in said dataset. See e.g., Fig. 2 and paras. [0025]-[0030]. 

According to the invention of claim 15, a system comprises an input port 
configured to receive data and a processor configured to regress functions 
correlating variable parameters of a set of the data, cluster the functions using a 
K-Harmonic Mean performance function, and repeat the regressing and 
clustering sequentially to thereby determine a pattern in the dataset. See e.g., 
Fig. 2 and paras. [0025]-[0030]. 

According to the invention of claim 18, a system comprises a plurality of 
data sources and a means for regressively clustering datapoints from the plurality 
of data sources without transferring data between the plurality of data sources to 
thereby determine a pattern in data contained in said data sources. See e.g., 
Fig. 2 and paras. [0025]-[0030]. 

According to the invention of claim 24, a system comprises a plurality of 
data sources each having a processor configured to access datapoints within the 
respective data source and a central station coupled to the plurality of data 
sources and comprising a processor. The processors of the central station and 
plurality of data sources are collectively configured to mine the datapoints of the 
data sources as a whole without transferring all of the datapoints between the 
data sources and the central station to thereby determine a pattern in datapoints 
contained in the data sources. See e.g., Fig. 2 and paras. [0025]-[0030]. 
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According to the invention of claim 28, a processor-based method for 
mining data comprises independently applying a regression clustering algorithm 
to a plurality of distributed datasets and developing matrices from probability and 
weighting factors computed from the regression clustering algorithm. The 
matrices individually represent the distributed datasets without including all 
datapoints within the datasets. The method further comprises determining global 
coefficient vectors from a composite of the matrices and multiplying functions 
correlating similar variable parameters of the distributed datasets by the global 
coefficient vectors to thereby determine a pattern in the datasets. See e.g., Fig. 2 
and paras. [0025]-[0030]. 
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VI. GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 

Whether claim 24 is anticipated by Zhang et al. ("K-Harmonic Means-A 
Data Clustering Algorithm," hereinafter the "Zhang Reference"). 

Whether claims 1,15, 17-22, 25-28, and 30 are obvious over the Zhang 
Reference in view of U.S. Pat. Pub. No. 2003/0145000 ("Arning"). 
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VII. ARGUMENT 

A. The anticipation rejection of claim 24 

With regard to claim 24, the Examiner's Final Office Action quoted the 
claim language and simply pointed to page 1 of the Zhang reference. 
Independent claim 24 requires a plurality of data sources and a central station. 
Each of the plurality of data sources and the central station comprise a processor. 
The claim further requires that the processors of the data sources and the central 
station "are collectively configured to mine the datapoints of the data sources as a 
whole without transferring all of the datapoints between the data sources and the 
central station." The Appellant has reviewed page 1 of the Zhang Reference, as 
well as the rest of the document, and simply does not find a teaching of this 
combination of limitations. 

Appellant made the argument above in his prior Appeal Brief. However, 
the Examiner did not seem to address this argument in the subsequent Office 
Action of July 16, 2007. Appellant respectfully requests allowance of claim 24, or 
at least an explanation as to why the Examiner disagrees with the argument. 

B. The obviousness rejections 

1 . Claims 1 , 1 5, 1 7 and 28-30 

Appellant selects claim 1 as representative of this claim grouping for 
purpose of the following argument. Claim 1 requires "iteratively applying a 
regression algorithm and a K-Harmonic Means performance function on the set 
number of functions to determine a pattern in said dataset." Claim 1 thus requires 
the combination of "regression" with K-Harmonic Means. The Examiner now 
concedes that the Zhang Reference does not disclose "regression" (Office Action 
of July 1 6, 2007 p. 4). Instead, the Examiner turns to Arning and believes that the 
claimed combination of regression with K-Harmonic Means is obvious. 

It is possible to combine regression with a number of different algorithms 
such as K-Means, expectation maximization (EM) or, as conceived by Appellants, 
K-Harmonic Means. To the extent that the Examiner believes the combination of 
regression with K-Harmonic Means is obvious, given the other possible choices, 
the Examiner's argument is clearly, and improperly, based in hindsight gleaned 
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from Appellants. See e.g., In re Dembiczak, 175 F.3rd 994, 999 (Fed. Cir. 1999) 
(reversing the Examiner and precluding the PTO from falling victim to the 
"insidious effect of a hindsight syndrome wherein that which only the inventor 
taught is used against its teacher"). 

Further, Appellants attach a publication, authored by the inventor, entitled 
"Regression Clustering" (Bin Zhang, Proceedings of the 3 rd IEEE International 
Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne. 
Florida, pp. 451-458) that shows that the regression/K-Harmonic Means 
combination works better than the combination of regression with either K-Means 
or EM. See section 12 Conclusions. The inventor determined that K-Harmonic 
Means-based regression is better than certain other types of regression (e.g., K- 
Means and EM-based regression). As a result, the inventor filed the present 
application to cover, at least in part, K-Harmonic Means-based regression. The 
Examiner is now improperly using the inventor's own hard-work and teachings 
against the claims. 

Based on the foregoing, Appellant respectfully submits that the rejections 
of the claims in this grouping be reversed, and the claims set for issue. 

2. Claims 18-22 

Appellant selects claim 18 as representative of this grouping. Claim 18 
requires "regressively clustering datapoints." Neither the Zhang Reference nor 
Arning teach regression clustering as required by claim 18. Further, there is no 
motivation to combine the Zhang Reference and Arning, absent the hindsight of 
Appellant's own teachings, which is improper. Based on the foregoing, Appellant 
respectfully submits that the rejections of the claims in this grouping be reversed, 
and the claims set for issue. 

3. Claims 25-27 

Claims 25-27 depend from claim 24, which is allowable over the Zhang 
Reference as explained above. The Examiner's rejection based on the 
combination of the Zhang Reference and Arning is improper as explained 
previously. Accordingly, the Examiner's rejection of claims 25-27 is in error. 
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C. Conclusion 

For the reasons stated above, Appellant respectfully submits that the 
Examiner erred in rejecting all pending claims. It is believed that no extensions 
of time or fees are required, beyond those that may otherwise be provided for in 
documents accompanying this paper. However, in the event that additional 
extensions of time are necessary to allow consideration of this paper, such 
extensions are hereby petitioned under 37 C.F.R. § 1.136(a), and any fees 
required (including fees for net addition of claims) are hereby authorized to be 
charged to Hewlett-Packard Development Company's Deposit Account No. 08- 
2025. 

Respectfully submitted, 

/Jonathan M. Harris/ 

Jonathan M. Harris 
PTO Reg. No. 44,144 
CONLEY ROSE, P.C. 
(713) 238-8000 (Phone) 
(713) 238-8008 (Fax) 
ATTORNEY FOR APPELLANT 

HEWLETT-PACKARD COMPANY 

Intellectual Property Administration 

Legal Dept., M/S 35 

P.O. Box 272400 

Fort Collins, CO 80527-2400 
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VIII. CLAIMS APPENDIX 

1 . (Previously presented) A processor-based method, comprising: 
selecting a set number of functions correlating variable parameters of a 

dataset; and 

clustering the dataset by iteratively applying a regression algorithm and a 
K-Harmonic Means performance function on the set number of 
functions to determine a pattern in said dataset. 

2. (Original) The processor-based method of claim 1 , wherein said clustering 
comprises: 

determining distances between datapoints of the dataset and values 

correlated with the set number of functions; 
regressing the set number of functions using datapoint probability and 

weighting factors associated with the determined distances; 
calculating a difference of harmonic averages for the distances determined 

prior to and subsequent to said regressing; and 
repeating said regressing, determining and calculating upon determining 

the difference of harmonic averages is greater than a 

predetermined value. 

3. (Original) The processor-based method of claim 2, wherein said 
determining the distances comprises determining distances from each datapoint 
of the dataset to values within each function of the set number of functions. 

4. (Original) The processor-based method of claim 2, wherein said selecting 
and said clustering are conducted for a plurality of datasets each from a different 
data source. 

5. (Original) The processor-based method of claim 4, wherein said selecting 
and said clustering are conducted in parallel for each of the plurality of datasets. 
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6. (Original) The processor-based method of claim 4, further comprising 
determining a common coefficient vector to compensate for variations between 
similar sets of functions within the different data sources. 

7. (Original) The processor-based method of claim 6, wherein said 
determining the common coefficient vector comprises: 

developing matrices from the dataset datapoints and the probability and 
weighting factors for each of the datasets prior to said reiterating; 
and 

determining the common coefficient vector from a composite of the 
developed matrices. 

8. (Original) The processor-based method of claim 7, further comprising 
multiplying the similar sets of functions within the different data sources by the 
common coefficient vector. 

9. (Previously presented) A storage medium comprising program 
instructions executable by a processor for: 

selecting a set number of functions correlating variable parameters of a 
dataset; 

determining distances between datapoints of the dataset and values 

correlated with the set number of functions; 
calculating harmonic averages of the distances; 

regressing the set number of functions using datapoint probability and 
weighting factors associated with the determined distances; 

repeating said determining and calculating for the regressed set of 
functions; 

computing a change in harmonic averages for the set number of functions 
prior to and subsequent to said regressing; and 
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reiterating said regressing, repeating and computing upon determining the 
change in harmonic averages is greater than a predetermined value 
to thereby determine a pattern in said dataset. 

10. (Original) The storage medium of claim 9, wherein the program 
instructions are executable using a processor for computing the datapoint 
probability and weighting factors. 

11. (Original) The storage medium of claim 9, wherein the program 
instructions are executable using a processor for developing matrices from the 
dataset datapoints and the probability and weighting factors prior to said 
reiterating. 

12. (Original) The storage medium of claim 11, wherein the program 
instructions are executable using a processor for amassing matrices developed 
from a plurality of datasets each from a different data source. 

13. (Original) The storage medium of claim 11, wherein the program 
instructions are executable using a processor for determining a common 
coefficient vector from the composite of matrices. 

14. (Original) The method of claim 13, wherein the program instructions are 
executable using a processor for multiplying similar sets of functions within the 
different data sources by the common coefficient vector. 

15. (Previously presented) A system, comprising: 
an input port configured to receive data; and 

a processor configured to: 

regress functions correlating variable parameters of a set of the 
data; 
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cluster the functions using a K-Harmonic Mean performance 
function; and 

repeat said regress and cluster sequentially to thereby determine a 
pattern in said set of data. 

16. (Original) The system of claim 15, wherein the processor is arranged 
within one of a plurality of data sources each comprising a processor configured 
to: 

regress the functions on a dataset of the respective data source; 

cluster the functions using a K-Harmonic Mean performance function; and 

repeat said regress and cluster sequentially. 

17. (Original) The system of claim 15, further comprising a central station 
coupled to the plurality of data sources, wherein the central station comprises a 
processor configured to compute common coefficient vectors which compensate 
for variations between the regressively clustered functions representing the 
datasets, and wherein each of the processors of the data sources is configured to 
alter the functions by the common coefficient vectors. 

18. (Previously presented) A system, comprising: 
a plurality of data sources; and 

a means for regressively clustering datapoints from the plurality of data 
sources without transferring data between the plurality of data 
sources to thereby determine a pattern in data contained in said 
data sources. 

19. (Original) The system of claim 18, wherein the means for regressively 
clustering the datasets comprises a means for applying a regression algorithm 
and a K-Harmonic Means performance function on the datasets. 



207387.02/2162.15800 



Page 1 6 of 28 



HP PDNO 200310832-1 



Appl. No. 10/694,367 

Supplemental Brief dated September 17, 2007 
Reply to Office action of July 16, 2007 

20. (Original) The system of claim 18, wherein the means for regressively 
clustering the datasets comprises a means for applying a regression algorithm 
and a K-Means performance function on the datasets. 

21. (Original) The system of claim 18, wherein the means for regressively 
clustering the datasets comprises a means for applying a regression algorithm 
and an Expectation Maximization performance function on the datasets. 

22. (Original) The system of claim 18, further comprising a central station 
communicably coupled to the plurality of data sources, wherein the means is 
further for: 

collecting dataset information at the central station from the plurality of 
data sources; 

determining a common coefficient vector from the collected dataset 
information; and 

altering datasets within the plurality of data sources by the common 
coefficient vector. 

23. (Original) The system of claim 18, wherein the means for regressively 
clustering the datasets comprises a storage medium with program instructions 
executable using a processor for: 

selecting a set number of functions correlating variable parameters of a 
dataset; 

determining distances between datapoints of the dataset and values 

correlated with the set number of functions; 
regressing the set number of functions using datapoint probability and 

weighting factors associated with the determined distances; 
calculating a difference of harmonic averages for the distances determined 

prior to and subsequent to said regressing; and 
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reiterating said regressing, determining and calculating upon determining 
the difference of harmonic averages is less than a predetermined 
value. 

24. (Previously presented) A system, comprising: 

a plurality of data sources each having a processor configured to access 
datapoints within the respective data source; and 

a central station coupled to the plurality of data sources and comprising a 
processor, wherein the processors of the central station and 
plurality of data sources are collectively configured to mine the 
datapoints of the data sources as a whole without transferring all of 
the datapoints between the data sources and the central station to 
thereby determine a pattern in datapoints contained in said data 
sources. 

25. (Original) The system of claim 24, wherein the each of the processors 
within the plurality of data sources is configured to regressively cluster a dataset 
within the respective data source. 

26. (Original) The system of claim 25, wherein the processor within the central 
station is configured to: 

collect information pertaining to the regressively clustered datasets; and 
based upon the collected information, calculate common coefficient 
vectors which balance variations between functions correlating 
similar variable parameters of the regressively clustered datasets. 

27. (Original) The system of claim 26, wherein the processor within the central 
station is further configured to: 

compute a residual error from the common coefficient vectors; 



207387.02/2162.15800 



Page 1 8 of 28 



HP PDNO 200310832-1 



Appl. No. 10/694,367 

Supplemental Brief dated September 17, 2007 
Reply to Office action of July 16, 2007 

propagate the common coefficient vectors to the data sources upon 
computing a residual error value greater than a predetermined 
value; and 

send a message to the data sources to terminate the regression clustering 
of the datasets upon computing a residual error value less than a 
predetermined value. 

28. (Previously presented) A processor-based method for mining data, 
comprising: 

independently applying a regression clustering algorithm to a plurality of 

distributed datasets; 
developing matrices from probability and weighting factors computed from 

the regression clustering algorithm, wherein the matrices 

individually represent the distributed datasets without including all 

datapoints within the datasets; 
determining global coefficient vectors from a composite of the matrices; 

and 

multiplying functions correlating similar variable parameters of the 
distributed datasets by the global coefficient vectors to thereby 
determine a pattern in said datasets. 

29. (Original) The processor-based method of claim 28, further comprising 
repeating said independently applying, said developing, said determining and 
said multiplying. 

30. (Original) The processor-based method of claim 28, further comprising 
calculating a residue error associated with the global coefficients prior to said 
multiplying. 
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IX. EVIDENCE APPENDIX 



Regression Clustering 

Bin Zhang 

Hewlett-Packard Research Laboratories, Palo Alto, CA 94304 
bzhang@hpl.hp.com 



Abstract 

Complex distribution in real-world data is often 
modeled by a mixture of simpler distributions. Clustering 
is one of the tools to reveal the structure of this mixture. 
The same is true to the datasets with chosen response 
variables that people run regression on. Without 
separating the clusters with very different response 
properties, the residue error of the regression is large. 
Input variable selection could also be misguided to a 
higher complexity by the mixture. In Regression 
Clustering (RC), K (>1) regression functions are applied 
to the dataset simultaneously which guide the clustering 
of the dataset into K subsets each with a simpler 
distrib ution match ing its guiding function. Ea ch function 
is regressed on its own subset of data with a much 
smaller residue error. Both the regressions and the 
clustering optimize a common objective function. We 
present a RC algorithm based on K-Harmonic Means 
clustering algorithm and compare it with other existing 
RC algorithms based on K-Means and EM. 



1. Introduction 

Two important data mining techniques are regression 
on the datasets with chosen response variables, and 
clustering on the datasets that do not have response 
information. The RC algorithm handles the case in 
between: the datasets that have response variables but 
they do not contain enough information to guarantee high 
quality learning, the missing part of the response is 
essential. Missing information is generally caused by 
insufficiently controlled data collection, due to a lack of 
means, a lack of understanding or other reasons. For 
example, sales or marketing data collected on all 
customers may not have the label on a proper 
segmentation of the customers. 

Clustering algorithms partition (hard or soft) a 
dataset into a finite number of subsets each containing 
similar data points. Dissimilarity labeled by the index of 
the partitions provides additional supervision to the K 
regressions running in parallel so that each regression 
works on a subset of similar data. The K regressions in 



turn provide the model of dissimilarity for clustering to 
partition the data. The linkage is a common objective 
function minimized by both the regressions and the 
clustering. Neither can be properly performed alone. 

The concept of regression clustering is not new. A 
number of earlier papers are reviewed in the next section. 
This paper adds a new member, Regression-K -Harmonic 
Means clustering, to the family of RC algorithms and 
compares its performance with others. 

1.1. Related Previous Work 

Regression clustering has been studied under a 
number of different names: Clusterwise Linear 
Regression in Spath [14-17], DeSarbo and Cron [2], 
Hennig [6-8] and others; Trajectory clustering using 
mixtures of regression models by Gaffney and Smith [4]; 
Fitting Regression Model to Finite Mixtures by Williams 
[20]; Clustered Partial Linear Regression by Torgo [19]. 
We choose the name Regression-Clustering because a) 
RC is not limited to linear regressions; b) Comparing RC 
with center-based clustering algorithms, KM, KHM, and 
EM, the centers are replaced by regression functions — 
RCs are just regression-function-centered clustering 
algorithms; c) By examining the computational structure, 
the clustering algorithm represents the main (outer) loop 
or the overall program structure, and the regression is 
called only as a subroutine to update the "centers". 

Clusterwise Linear Regression by Spath [14-17] used 
linear regression and partition of the dataset in his 
algorithm that locally minimize the total mean square 
error over all ^"-regression (Eq. (2)). He also developed 
an incremental version to allow adding new observations 
into the dataset. Spath' s algorithm is based on K-means 
clustering algorithm. DeSarbo [2] used maximum likely- 
hood methodology for performing clusterwise linear 
regression, locally minimizing the objective function (Eq. 
(16)). A marketing application is presented in his paper. 
We will briefly introduce the details of his work in 
section 6 for comparison. Hennig continued the research 
of Clustered Linear Regression using the same linear 
mixing of Gaussian density functions. The number of 
clusters in his work is treated as unknown. Gaffney and 
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Smyth's work |4J is also based on KM clustering 
algorithm. 

1.2. Contributions of This Paper 

Previous work on RC used K-Means and EM in their 
algorithms, these RC algorithms will have the same well- 
known problem of being sensitive to the initialization of 
the regression functions as the K-Means and EM being 
sensitive to the initialization of the centers. 

The author developed a center- based clustering 
algorithm, K-Harmonic Means, which is much less 
sensitive to initialization of centers. Tt is demonstrated 
through a large number of experiments on randomly 
generated datasets that KHM converges to better local 
optimum than K-Means and EM, measured by a common 
objective function of K-Means (Zhang [23][24]). 

Tn this paper, we add a new algorithm RC-KHM to 
the family of RC algorithms (Section 4 and 5), provide 
performance comparisons of the three RC algorithms 
based on extensive experimental results (Section 1 1 and 
Fig. 5.), and give an interpretation of the A'-regression 
functions as a predictor and its combination with a K-way 
classifier (Section 10). 

The rest of the paper is organized in sections as: 
Section 2, defining the problem; Section 3 and 4, the RC- 
KM and its special case LinReg-KM. Section 5 and 6, 
the new RC -K-Harmonic Means and its special case 
LinReg-KHM; Section 7 and 8, the RC -Expectation 
Maximiza-tion algorithm and LinReg-EM; Section 9, 
computational costs; Section 10, probability 

interpretation of the K re-gression functions as predictors; 
Section 11, experimental results and comparisons; 
Section 12, Conclusions. 

2. The Problem 

Given a dataset with supervising responses, Z — ( 
X, y) = {(x i ,y i ) \ i = 1, TV} , a (constrained) family 
of functions O — {f} and an loss function eQ > 0 , 
regression solves the following minimization problem, 

N 

Usually, Q-i^fihteVi) I Pi a i e R "} ' linear 

expansions of simple parametric functions such as 
polynomials of degree up to m, Fourier series of bounded 
frequency, neural networks, RBF, Usually, 

e (f( x ),y) = \\f( x )-y\\ P > Wltn P=l,2 most widely 
used. (1) is not effective when the data set contains a 
mixture of very different response characteristics as 



shown in Fig. la. It is much better to find the partitions 
in the data and learn a separate function on each partition 
as shown in Fig. lb. 




Fig. 1: a) Left: a single function is regressed on 
all training data which is a mixture of three 
different distributions. b) Right: three 
regression functions, each regressed on a 
subset found by RC. The residue errors are 
much smaller. 

We assume that there are K partitions in the data. 
Determining the right K has been discussed in the 
clustering context [5][18], which still applies under our 
new setting. K can also be determined (or bounded) by 
other aspects of the original problem. 

In RC algorithms, K regression functions M = 
{f L ,...,f £ }c<£> are applied to the data, each of which 

finds a partition Z k and regress on it. Both parts of the 

process — the K regressions and the partitioning of the 
dataset - optimize a common objective function. The 
partition of the dataset can be a "soft" partition given by 
K density functions defined on the dataset. 

3. The RC-KM Algorithm 

Clusterwise Einear Regression [14] is the simplest 
RC algorithm. We review it as an introduction to RC. 
The K regressions do not have to be linear. 

RC-KM solves the following optimization problem, 

where 7 = |^J Ar z k (Z lr f)Z lr ,=0,k^k')- The optimization 

is over both the K regression functions and the partition. 
The optimal partition will satisfy 

Z k ={{x,y)eZ\e(/Z pt {x),y)<e(JZ!*{x),y} Vk'^k}, (3) 
which allows us to replace the function in (2) by 

/=* 

RC-KM Algorithm, a monotone- convergent algorithm to 
find a local optimum of (2): 

Stepl: Pick K functions /J (0) ,...,/ A l 0) e <3> randomly, or by 
any heuristics that are believed to give a good start. 
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Step 2: Clustering Phase: In the r-th iteration, r=l, 2, 
repartition the dataset as 

2f ={(x,y)GZ\^- 1 \x),y)<^r 1 \xXy) W^k}. (5) 

(A tie can be resolved randomly among the winners.) 
Intuitively, each data point is associated with the 
regression function that gives the smallest approximation 

error on it. Algorithm ically, a data point in Zjf~^ is 

moved to Z ( ;? iff e(f™ (x% y) < e( (x), y) 

and e(f£r l \xXy)<e(f^\x),y} for all k* * 

k 9 k\ ZP gets all the data points in Z^ _1) that are not 
moved. 

Step3: Regression Phase: Run any regression optimiza- 
tion algorithm that gives the following 

f k (r) = argirdn £ «</(*,-*>',) for k = 1>—> K • ( 6 ) 

(The regression algorithm is selected by the nature of the 
original problem or other criteria. RC adds no additional 
constraint on its selection.) 

Step 4: Stopping Rule: Run Step 2 and Step 3 repeatedly 
until there is no more data points changing its 
membership. 

Step 2 and Step 3 never increase the value of the 
objective function in (2). If any data changes its 
member-ship in Step 2, the objective function is strictly 
decreased. Therefore, the algorithm stops in finite 
number of iterations. 

4. MSE Linear Regression with K-Means 
Clustering -- LinReg-KM 

With D functions l\(x\... 7 h 3 (x) chosen as the 

D 

basis, we consider the function class <D =(^c l h l {x) 

| Cj G R} . To simplify the notations, let x = {J\{x\... 
, h 3 (xj) and X = [x t ] NxE ■ As an example, for the set of 
two-variable (D=2) polynomials up to degree 2, the basis 
functions are k l (x) = U k 2 (x) = x l , h, (x) = x 2 ,h, (x) = x? , 
\ (x) - x l x 2 , h 6 (x) = x\ • We hav e 



X = 



1 X x 



1 



<1 



x lx x x: 



-*-N,2 ■ A -N,\ A N,\ A N,2 ■ A -N,Z_ 

With the MSE e(f(x), y) = \ f(x) - y | 2 , LinReg- 
KM minimizes the objective function 



With row-partition of Z into K subsets Z^...,Z K , matrices 
X and Tare row-partitioned accordingly, x -+X X ,...,X K 
and Y->Y l3 ... t T x , the c oe f fic ients of the optim al function 
on the Jt-th subset is (Step 3 of the RC-KM) 

o k ={^l*X k rxl*Y t . (7) 

The matrix of losses used for the comparisons in Step 2 of 
RC-KM is 

S=[</ifti>f)U = ote(^[q,...,cJ-|7,...,y]). (8) 
(squaring is monotone and not necessary.) 

5. RC-K-Harmonic Means Algorithm (RC- 
KHM) 

K-Means clustering algorithm is known to be 
sensitive to the initialization of its centers. The same is 
true for RC-KM. Convergence to a poor local optimum 
has been observed quite frequently (See Fig 5). 

K-Harmonic Means clustering algorithm showed 
very strong insensitivity to initialization due to its 
dynamic weighting of the data points (Zhang 2001, 
2003). The regression clustering algorithm RC-KHMp is 
presented in this section. It is shown experimentally that 
it out-perform s RC-KM and RC-EM. 

RC-KHMp' s objective function is defined by 
replacing the MIN() function in (4) by the harmonic 
average HAQ. The error function is e(f k (x i ),y i ) = 

Perf^ f {Z,M)^m{\\f k {xyy \f}=Z~ ~ x 

(9) 

An iterative algorithm (see Zhang 2001) is available for 
finding a local optimum of (9). 

RC-KHM Algorithm: 

Step 1 : Pick K functions , . . . , f£ y e 0> random ly. 
Step 2: Clustering Phase: In the r-th iteration, let 

a) The hard partition Z = \^J k ^Z t , in RC-KM, is 

replaced by a "soft" membership function - the i -th data 
point is associated with the k-\h regression function with 
probability 



P( Z k l z i) = ' 



en) 



The choice of q (>=1) will put the regression's error 

function in 13 -space. See (13). (This is more general 
than the AT-Harmonic Means clustering algorithm 
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presented before, which had q = 2.) For simpler 
notations, we do not index p(Z k | z t ) and a p (z. ) in (12) 

by q. Quantities d i k , p(Z k | z ; ) , and a p (z ; ) should be 

indexed by the iteration r, which is also dropped, 
b) In RC-KHM, not all data points fully participate in 
all iterations like in RC-KM. Each data point's 
participation is weighted by 

/iX ( ]2 ) 

i=i I i=i 

a p (z s ) is small if and only if z i is close to one of the 
functions (i.e. done for it). Weighting function a (zj 
changes in each iteration as the regression functions are 
updated. If all functions drifted away from a point z. in 
the last iteration, goes up. More details on this 

weighting function are in (Zhang 2001). 

Step 3: Regression Phase: Run any regression 

optimization algorithm that gives the following 

f k (r) = arg min£ a p (z i )p{Z k \ zj \\ f(x i )-y j ||* 

/ e * i=i 

for k = l,...,K. (13) 
Step 4: Since there is no discrete membership change in 
RC-KHM, the stopping rule is replaced by measuring the 
changes to its objective function (9), when the change is 
smaller than a threshold, the iteration is stopped. 

6. Linear Regression with K-Harmonic 
Means Clustering — LinReg-KHM 

For linear regression, we choose q=2. Writing (13) in 
matrix form, we have 

4 r) =ar g n^{X*c-Yf*dqg(a p (z i )p(Z k \ z$*(X*c-Y) 

c l<i<N 

and its solution is 



f 




( k i 






V 













2 




















«5 



where d , =11 3c *cf r-1) - 



(15) 

is a matrix of size 



*,-,fc-||*i -ViW (Ma 

NxD with entries a being one of three possibilities: row 
vectors, column vectors or scalars.) The inversion in 
(15) is on a BxD matrix. 

7. The RC-EM Algorithm 

One of the applications of the general EM algorithm 
(McLachlan and Krishnan [11]) is on probability density 
estimation or clustering. The best of the linear mixing of 



Gaussian EM clustering algorithm is the natural 
probability interpretation of its linear mixing 
(superposition). We include a brief presentation of RC- 
EM for comparing the performance of all three algorithms 
in Section 11. The objective function for RC-EM is 
defined as 
Petf^_ E JZM)= 

where d =dim(Y). In case d=l, (f lr (x i ') — y i ') is just a 

real number and E" 1 = l/<T^ . In higher dimensions, 

restriction to the covariance matrix Zi k is necessary for 

EM to work properly. l> k = diagonal matrix is often 
used. 

The RC-EM recursion is given by 
E-Step: 

^^(-i(/r\^)-^)i;i u (/r ) (^)-^f) 

(17) 
(18) 

(19) 
(20) 



M-Step: p?=±±p(Z?\z} 

N 

/« = argmin^X^^,)!! /(*,-)-*- II 2 

,=1 

2><z« i ^x/, Cr) (^)-^) r (/, (r) (^)-^) 
_ j=i 



N * pi 



8. MSE Linear Regression with EM 
Clustering — LinReg-EM 

When MSE linear regression is used, (19) can be 
solved and takes the following special form, while all 
other formulas (16)-(18) and (20) remain the same. 

dp = (F *[p(4 r \ 1 )^nV *[p(4 r \ u 

(21) 

Very strong similarity between (21) and LinReg-KHM' s 
(15), or between (21) and LinReg-KM 7 (7) can be 
observed. 

9. Computational Costs for RCs with MSE 
Linear Regression 

We compare the cost of one iteration of RC with the 
cost of single function linear regression on the whole 
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dataset without clustering for all three examples LinReg- 
KM, LinReg-KHM and LinReg-EM. This comparison 
shows the cost ratio of switching from single function 
regression to RC. 

The cost of forming X is common to both RC and 
single lmear regression. In single linear regression, the 
cost of calculating c = (X T *X )~ l X T *Y is the sum of 
(an unit of calculation here is multiplying two numbers 
and adding the result to another number): D 1 * N units 
for forming X T *X, D 2 + D*N units for forming 

X T *Y and J3D 3 for solving (X T *X )*c = X T *Y , 

P is a small constant, where D = m + l if D=l, or 

_ = for D >2 r)=di m (X). N >D , otherwise 

D-l 

the regression has infinite solutions. We assume that 
N » D , otherwise the potential of over fitting (and/or 
over shooting) is high. In any case the dominate term is 
O0 2 *N). Let N k be the size of the Idh cluster, the 

costs of K regressions are I) 2 * N k = D 2 * N units for 

all Xl*X k , k=l,...,K, KD 2 +D*N units for all 
X k * Y k and K J3L? for solving K linear equations, 
(X T k *X k )*c k =X T k *Y k . Kis very small and we do not 
expect it ever to be large (say > 50). The repartition cost 
for LinReg-KM is 0(D*N*K) due to the number of 

error function evaluations and comparisons. Therefore, 
the cost of each iteration of LinReg-KM is at the same 
order of complexity as the simple single function 
regression. 

We observed a quick convergence at start in all 
experiments but some of them had a long tail. (See 
Section 11.2) 

The cost of calculating the repartition probabilities in 
LinReg-KHM and LinReg-EM are in the same order as 
the repartition cost in LinReg-KM. 

With input variable selection, not all the variables 
selected for the single function regression need to appear 
in the selected variables for each subset. Therefore, the 
dimensionality of the regression problem on each subset 
may become lower. 



10. Probability Interpretation of RC's K 
Regression Functions 

Regression results are most often used for 
predictions, y = f(x) is taken as a prediction of the 

response at a new X X . With K regression functions 



returned by RC, we get K predictions {f k {x)} k=1 on the 
same input X , which is interpreted in this section. 

Assuming that dataset X is iid sampled from a hidden 
density distribution P(). Kernel density estimation on the 

K ^-projections of Z k = {p(Z k \ z) \ z = (x, y) & Z} 

(for KHM and EM see (11) & (17), for KM they are the 
real subsets) gives 



PQx\XJ = ^- 



with 



1 



P(X k ) = —T,p(Z k \ Zi )- 



(23) 



HQ in (22) is a symmetric kernel and h the bandwidth 

(See [13]). If we add the density estimation of each 
subset, we get the kernel density estimation on the whole 
dataset, 

x, - X 



P( X ) = f i P(x\X k )P(X k ) = ±f i H 



h 



(24) 



Bayes' inversion gives the probability of r belongs to 
each subset, 



h 



(25) 



Let f(x) be the random variable prediction which 
equals f k (x) with probability P(X k \ x) , and the 
expected value of this prediction is estimated by 



E(f(x^^ k (x)KX k \x)=^ 



S 






N 

TP 







(26) 

A random variable contains more information than its 
expectation; therefore, RC's prediction f(x)\x, a 
random variable, gives more information than its 
expectation E{ f{x) \ x) . Instead of giving a single 

valued prediction with a large uncertainty, f(x) \ X 
gives K possible values each with a much smaller 
uncertainty. The significant part of the uncertainty is 
described by the probability distribution (P(X k \x), 
k=l,...,K). 

A classifier, k=C(x), can be trained using the labels 
provided by the clustering phase of the RC algorithm. In 
case the false classification rate of C is low, which may 
not be true for some datasets, a prediction on x can be 

/c(x)O)- 
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11. Experimental Results 

We conducted three sets of experiments: Set 1 for 
visualization of RC, and Set 2 for statistical comparisons 
of LinReg-KM, LinReg-KHM and LinReg-EM. 

11.1. Visualization Experiments 

This section visually demonstrates RC. Statistical 
performance analysis and comparison of different 
variations of RCs are in the next section. 

Dimensionality of X is 1, so that 2-dimensional 
visualization can be presented. Linear regression RC is 
already demonstrated in Fig.lb. We do both quadratic 
(Fig. 2) and trigonometric (Fig. 3) regressions in this 
section. 



Fig. 2. N=600, D=i, K=3. On the left is the result 
of simple quadratic regression on the whole 
dataset. On the right is LinReg-KHM. 





Fig. 3. N=1200, D=1, K=3. ® = {a v sm(6^) +a z x 
| a- e K) and the data set is a mixture of three 

subsets generated by three functions in <P with 
added Gaussian noise. Left: one regression 
function is applied to the whole dataset. Right: 
three regression functions are used. Each of 
them found a very good approximation of the 
original functions used to generate the dataset. 

r 



Fig. 4. A local optimum. It happens to all three 
RC algorithms, RC-KM, RC-KHM, and RC-EM. 




Ploy-KHM, with the version KHM presented in Zhang at. 
el [22] which is better for one and two dimensional 
spaces, is used in this section. 

A local optimum is shown in Fig. 4. This tells us 
how the algorithms may fail to reach the global optimum. 
Knowing this helps to manually correct it, by providing a 
special imtializati on after' seeing a suspected result. 

11.2 Statistical Comparisons of LinReg-KM, 
LinReg-KHM and LinReg-EM 

Twelve sets of experiments, with D = 2, 4, 6, 8 
and K = 3, 6, 9, are conducted. In each set, 60 
datasets with TV = 50*D*K are generated by 
randomly picking JV points on K randomly generated 
hyperplanes and then adding Gaussian noise to the 
>'-components. The regression functions are linear 
(hyperplanes). For each dataset, a common 
initialization of the regression functions is used for 
all three different algorithms. 

To make direct comparisons of three algorithms 
possible, we have to measure them by a common 
performance measure, which is chosen to be the 
LinReg-KM's objective function in (2). After 
LinReg-KHM and LinReg-EM converged, we 
discard its own performance measure, and re- 
measure its result by the LinReg-KM's. Doing so is 
slightly in favor of LinReg-KM. We use the 
notations Perf KEMfKM 2iidPerf EMlKM for these re- 
measurements. 

Taking advantage of the known partitions of the 
synthetic datasets, we calculated a P erf baseiine , by 

running regression on each of the K subsets and add 
them up, for comparing against the performance of 
LinReg-KM and LinReg-KHM. Perf bcseUfie is close 

to the global optimum. 

The results are in Fig 5. Each curve has 60 
points from the 60 runs of RC, without interpolation. 
Four curves in each plot, which are frequency- 
estimations of the accumulative distributions in (22)- 
(25), with v-axis horizontal and prob-axis vertical, 



{ A 

(22-25) 
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Fig 5. The accumulative distribution of the per- 
formance ratios. Icons and the text in each plot: 
black squares: LinReg-KHM over LinReg-EM; 
blue O's: LinReg-KHM over the baseline; red 
(+)'s: LinReg-KM over the baseline and green 
triangles: LinReg-EM over the baseline, ml = 
mean of the ratios of LinReg-KHM over LinReg- 
EM, m2 = mean of the ratios of LinReg-KHM over 
the baseline, m3 = mean of the ratios of LinReg- 
KM over the baseline, and m4 = mean of the 
ratios of LinReg-EM over the baseline. 

The plot of (22), in black squares,, shows how often 
LinReg-KHM performed better then LinReg-EM, 
with equal performance when the ratio is 1 . 

The plot of (23), in blue (*) J s, shows how well 
LinReg-KHM performed against the ^ er f ba ^ Uyie ^ 
which should be very close to the true optimum. 
When the value is close to 1, a very good 
approximation of the global optimum was found. 



The plot of (24) in red (+)'s and (25) in green 
triangles shows how well LinReg-KM and LinReg- 
EM performed against the P^f beatUnB ■ 

We truncated the x-axis to make the interesting 
part of the plot (near 1) more readable. 

In addition to the plotted distributions in (22)- 
(25) ? the expectation is also given on each plot, 



ml; 



(26) 



W3: 



m4 : 



Observations: A) Except for K=3 and D=2, LinReg- 
KHM performed the best among the three. As i^and 
D increase, the performance gaps become larger; B) 
LinReg-EM performed better than LinReg-KM on 
average for all K and D. This is due to the low 
dimensionality of the 7-space (dim(Y) =1), where the 
clustering algorithms are applied; C) In my previous 
comparisons on the performance of center-based 
clustering algorithms (Zhang 2003), Jt-means 
performed better than EM on average on datasets 
with dimensionality > 1. The higher the 
dimensionality of the data, the more K-Means out- 
perform EM. 

12. Conclusions 

Clustering recovers a discrete estimation of the 
missing part of the responses and provides each 
regression function with the right subset of data. A 
new regression clustering algorithm RC-KHM is 
presented. LinReg-KHM outperforms both LinReg- 
EM and LinReg-KM. 

In the general form of RCs, the regression part 
of the algorithm is completely general, no 
requirements is added to it by the RC algorithm. 
This implies that a) RC algorithms work with any 
type of regression; b) RC can be built on top of 
existing regression libraries and call the existing 
regression program as a subroutine. 

We give two other advantages of using RC. 
Regression helps with understanding the data by 
replacing it with an analytical function plus a residue 
noise. When the noise is small, the function 
describes the data well. RC does a much better job 
requirements is added to it by the RC algorithm. 
This implies that a) RC algorithms work with any 
type of regression; b) RC can be built on top of 
existing regression libraries and call the existing 
regression program as a subroutine. 
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We give two other advantages of using RC. 
Regression helps with understanding the data by 
replacing it with an analytical function plus a residue 
noise. When the noise is small, the function 
describes the data well. RC does a much better job 
on this. The compact representation of data by a 
regression function can also be considered as (or 
part of) data compression. With a significantly 
smaller mean residue noise, RC does a much better 
job on this too. 

EM's linear mixing of simple distributions has 
the most natural probability interpretation. To 
benefit from both the EM's probability model and 
the KHM algorithm's robust convergence, we 
recommend running RC-KHM first and use its 
converged results to initialize RC-EM. RC-KHM 
does not supply the initial values for p^p and Z r k . 

To solve this problem, keep the initial function- 
centers fixed at the RC-KHM' s output for a number 
of iterations to let the probabilities p ( k r) and Z r ^ to 

converge under RC-EM before setting the function- 
centers free. 
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